Improved data-driven likelihood factorizations for transcript abundance estimation
نویسندگان
چکیده
Motivation Many methods for transcript-level abundance estimation reduce the computational burden associated with the iterative algorithms they use by adopting an approximate factorization of the likelihood function they optimize. This leads to considerably faster convergence of the optimization procedure, since each round of e.g. the EM algorithm, can execute much more quickly. However, these approximate factorizations of the likelihood function simplify calculations at the expense of discarding certain information that can be useful for accurate transcript abundance estimation. Results We demonstrate that model simplifications (i.e. factorizations of the likelihood function) adopted by certain abundance estimation methods can lead to a diminished ability to accurately estimate the abundances of highly related transcripts. In particular, considering factorizations based on transcript-fragment compatibility alone can result in a loss of accuracy compared to the per-fragment, unsimplified model. However, we show that such shortcomings are not an inherent limitation of approximately factorizing the underlying likelihood function. By considering the appropriate conditional fragment probabilities, and adopting improved, data-driven factorizations of this likelihood, we demonstrate that such approaches can achieve accuracy nearly indistinguishable from methods that consider the complete (i.e. per-fragment) likelihood, while retaining the computational efficiently of the compatibility-based factorizations. Availability and implementation Our data-driven factorizations are incorporated into a branch of the Salmon transcript quantification tool: https://github.com/COMBINE-lab/salmon/tree/factorizations . Contact [email protected]. Supplementary information Supplementary data are available at Bioinformatics online.
منابع مشابه
FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem
FlipFlop implements a fast method for de novo transcript discovery and abundance estimation from RNA-Seq data. It differs from Cufflinks by simultaneously performing the transcript and quantitation tasks using a penalized maximum likelihood approach, which leads to improved precision/recall. Other softwares taking this approach have an exponential complexity in the number of exons in the gene. ...
متن کاملImproved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition
Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...
متن کاملFast and accurate quantification and differential analysis of transcriptomes
Fast and accurate quantification and differential analysis of transcriptomes by Harold Joseph Pimentel Doctor of Philosophy in Computer science University of California, Berkeley Professor Lior Pachter, Chair As access to DNA sequencing has become ubiquitous to scientists, the use of sequencers has expanded from determining the genomes of individuals to performing molecular probing assays. Thes...
متن کاملCapture-recapture abundance estimation using a semi-complete data likelihood approach
Capture-recapture data are often collected when abundance estimation is of interest. In this manuscript we focus on abundance estimation of closed populations. In the presence of unobserved individual heterogeneity, specified on a continuous scale for the capture probabilities, the likelihood is not generally available in closed form, but expressible only as an analytically intractable integral...
متن کاملMultiscale Likelihood Analysis and Complexity Penalized Estimation
We describe here a framework for a certain class of multiscale likelihood factorizations wherein, in analogy to a wavelet decomposition of an L function, a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and the parameter vector localized in position and scale. The framework is developed as a set of suffic...
متن کامل